In this paper, we propose a new multi-objective contextual multi-armed banditproblem with two objectives, where one of the objectives dominates the otherobjective. Unlike single-objective bandit problems in which the learner obtainsa random scalar reward for each arm it selects, in the proposed problem, thelearner obtains a random reward vector, where each component of the rewardvector corresponds to one of the objectives and the distribution of the rewarddepends on the context that is provided to the learner at the beginning of eachround. We call this problem contextual multi-armed bandit with a dominantobjective (CMAB-DO). In CMAB-DO, the goal of the learner is to maximize itstotal reward in the non-dominant objective while ensuring that it maximizes itstotal reward in the dominant objective. In this case, the optimal arm given acontext is the one that maximizes the expected reward in the non-dominantobjective among all arms that maximize the expected reward in the dominantobjective. First, we show that the optimal arm lies in the Pareto front. Then,we propose the multi-objective contextual multi-armed bandit algorithm(MOC-MAB), and define two performance measures: the 2-dimensional (2D) regretand the Pareto regret. We show that both the 2D regret and the Pareto regret ofMOC-MAB are sublinear in the number of rounds. We also compare the performanceof the proposed algorithm with other state-of-the-art methods in synthetic andreal-world datasets. The proposed model and the algorithm have a wide range ofreal-world applications that involve multiple and possibly conflictingobjectives ranging from wireless communication to medical diagnosis andrecommender systems.
展开▼